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Resume 

Un test d'adequation nonparametrique pour la regression univariee. 

Dans le cadre de la regression univariee, nous proposons un outil nonparametrique general permettant de tester 
si une fonction connue m est un bon candidat pour la fonction de regression au vu des donnees. Ce test est base 
sur la longueur maximale des suites ordonnees (par rapport a la covariable) des residus de meme signe. Aucune 
hypothese n'est faite sur I'liomoscedasticite des erreurs. De plus, ce test ne necessite pas la presence de donnees 
repetees. Nous donnons ici la loi de la statistique test sous I'hypothese nuUe que la fonction consideree m est la 
vraie fonction de regression ainsi que sous une certaine classe d'hypotheses alternatives. Pour citer cet article : 
A. Noml, A. Nom2, C. R. Acad. Set. Pans, Ser. I 340 (2005). 

Abstract 

A simple test is proposed for examining the correctness of a given completely specified response function against 
unspecified general alternatives in the context of univariate regression. The usual diagnostic tools based on resid- 
uals plots are useful but heuristic. We introduce a formal statistical test supplementing the graphical analysis. 
Technically, the test statistic is the maximum length of the sequences of ordered (with respect to the covariate) 
observations that are consecutively overestimated or underestimated by the candidate regression function. Note 
that the testing procedure can cope with heteroscedastic errors and no replicates. Recursive formulae allowing 
to calculate the exact distribution of the test statistic under the null hypothesis and under a class of alternative 
hypotheses are given. To cite this article: A. Noml, A. Nom2, C. R. Acad. Sci. Paris, Ser. I 340 (2005). 
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1. Introduction 



Regression is one of the most widely used statistical tools to examine how one variable is related to another. 
Statisticians usually begin their work by proposing a model for their observations. Then, they have to 
check on whether this model is correct. The graphical analysis of the residuals is an important step of this 
process since the detection of a systematic pattern would indicate a misspecified model. Unfortunately, 
this procedure is heuristic and could lead to errors of interpretation since it is often difficult to determine 
whether the observed pattern reflects model misspecification or random fluctuations. So it is of interest to 
complement such an analysis by a formal test. A large literature in this area can be found in Hart (1997). 
A review of statistical tests and procedures to determine lack of fit associated with the deterministic 
portion of a proposed linear regression model is presented in Ncill and Johnson (1984). We propose a 
new approach based on maximum length of sequences of consecutive overestimated (or underestimated) 
observations by the model. This test is very simple and can be computed visually if the sample size is 
small enough. This test is a modification of a nonrandomness test (see Bradley 1968, chap. 11). In other 
words, we use this it to detect whether residuals are randomly distributed or not. 

In Section 2, the Length of the Longest Run Test is presented. Section 3 is devoted to the law of the test 
statistic under the null hypothesis. In Section 4, the power of the test for a class of fixed alternatives is 
given. 



2. The Length of the Longest Run Test Statistic 

Consider a collection of n random variables Yi generated as 

Fi = mo(xi) + Ei, « = l,...,n, 

where the Xi are fixed design points and mo is the true regression function. Moreover, the are indepen- 
dent and centered random variables such that: 

Vi = l,...,n, Pr(ei > 0) = Pr(£i < 0) = (1) 

Note that no hypothesis is made on the regularity of the function mp or on the fact that errors must 
he identically distributed or homoscedastic, and that normality of implies Condition (1). Moreover, 
contrary to other classical tests (like the F-test), no replicates are needed to compute our test statistic. 
We address the problem of testing the null hypothesis 

Hq : mo = m vs. Hi : mo ^ m, 

where m is a completely specified function. 

The i-th residual, e^j, may be seen as substitute for the realisation of the random variable e^, thus com- 
prising clues for adequacy or inadequacy of the model assumptions related to the distribution of e^. Some 
classical lack-of-fit test statistics are based on squared residuals, hence their signs are neglicted, and we 
can expect to loose some information. We propose a test statistic that takes these signs into account. 
This test statistic, L„, is the maximum length of the sequences of ordered (with respect to the covariate) 
observations that are consecutively overestimated (or underestimated) by the candidate m. Formally, we 
define Zi := l^^>o}' 1 < * < ^) 'S'o := 0, 5; := Zi + . . . + Zi, and put ioi < K < n, 

I+{n,K):= max {Si+k - Si). 
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Let be the largest integer K for which I^{n,K) = K. L+ is the length of the longest run of I's in 
Zi, . . . , Zm i.e. the length of the longest run of positive residuals. By analogy, we define L~ as the length 
of the longest run of O's in Zi, . . . , Z„, that is is the largest integer K for which I~{n, K) = K, where 

I-{n,K):= max (K-Si+k + Si). 

0<l<n—K 

Clearly, L~ is the length of the longest run of negative residuals. Finally, we define L„ := max (L^, L~) . 
For a fixed nominal level a > 0, we obtain the following unilateral rejection regions Wn,a = {Ln > Cn,a} > 
where Cn,a is the largest integer such that Pr{Ln > Cn,a) > ck- The corresponding bilateral rejection 
regions are Wn,a = {K i [c„,i-a/2, c„,c</2]}- 



3. Distribution of Ln under the null hypothesis 

If m is equal to mo, then, the residuals are the true errors Since Condition (1) holds, we can apply 
the following recursive formula (Riordan (1958), p. 153, Problem 13): 

(n - 1)! Pr{Ln = k) = 2{n - 2)! Pr(L„_i = k) - (n - k - 2)1 PriL^-k-i = k) 

+{n - 2)! Pr{Ln-i =k-l)-2{n- 3)! Pr{Ln-2 = k-l) 
+{n-k- 1)! Pr(i„_fe = fc - 1). 

By using Pr{L2 — 2) = 1/2 and V n > 0, Pr{Ln = 1) = l/2"~^, the entire exact distribution of i„ and 
critical values for every nominal Icvcil can be deduced from the above formula. 

For most of practical cases of interest, m is estimated. For example, if m is estimated by OLS, an 
unfortunate property of residuals is that they are autocorrelated even when the true errors are white 
noise. This divergence from the assumptions disappears in large samples, but may be a problem when 
performing diagnostic tests in small samples. One way of handling this problem is to transform the OLS 
residuals so that they do satisfy the LS assumptions when these are correct. One of the most common 
of these transformations are the so called recursive residuals (see Kianifard and Swallow (1996) among 
others). 

Another possibility is to estimate m on a subset of the data and to test it on the rest of the data. 
In a coin tossing experiment, L„, L+, and L~ can be seen as the length of the longest run of heads or 
tails, heads and tails, respectively. The length of the longest head run in a coin tossing experiment was 
investigated in the early days of probability theory. Later, Deheuvels (1985) gives upper and lower bounds 
for for a biased coin. 

Schilling (1990) discusses the distributions of L„ for unbiased coins, and remarks that for n tosses of a 
fair coin the length of the longest run of heads or tails, statistically speaking, tends to be about one longer 
than the length of the longest nui of heads only. For a biased coin, when n is very large, if head is more 
likely than tail, the distribution function of I/+ is well approximated by an extreme value distribution 
(see Gordon et al. (1986)). 



4. Distribution of L„ under fixed alternative hypotheses 

The distribution of the Length of the Longest Run Test statistic can be calculated under some fixed 
alternative hypotheses. First of all, we suppose that Condition (1) is fulfilled, and that errors are identically 
distributed. 
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Moreover, if we test 

Ho : M X, mo{x) = m{x) vs. -ffi,c : V a;, mo{x) = m{x) + c, c^O 

then, under -ffi.c, the probabUity for an observation to be underestimated (respectively, overestimated), 
p(c) 7^ ^, is constant for all the observations. By considering the total number of positive residuals, k, in 
the sequence, the cumulative distribution of i„ can be expressed as : 



P{Lr.<x) = J2si'\x)p{c)\l-p{c)y 



\n—k 
fe=0 

where {x) is the number of sequences of length n that contain k positive residuals in which the length 
of the longest run oi positive or negative residuals does not exceed x. Analogously, Schilling (1990) studied 
the cumulative distribution of 

In the following Proposition, we give a recursive formula to compute the Sn\x): 
Proposition 4.1 Let n and x such that Q < x <n. Then, 

(i) Ifn-k<x andk< x, Si^\x) = C^. 

(ii) Ifn-k<xandk>x, s'it\x) = EJ=o 

(ni) Ifn-k>x and k < x, Sk^\x) = E.^^^ Sl^^+/-'\x). 
(iv) If n — k > X and k > x, let 



3>0 



-s: 



,(k-U+l)(x+l)) (^ _ q{k-l-j{x+l)-i) (^\\\ 

n-l-i2j+l){x+l)-i\'^) '^n-l-{2j+l){x+l)-i\'^) j / 



with the following conventions:^ x G N*, R^o\x) = 1 anrfVn G N*, fc e N*, Rin\x) = R'lI{x) = 
Ri~''\x) = 0. Finally, 

- 7/3 e {l,...,x} xN* such that {k,n) = {2j{x + 1) + i,j{x + 1)) or {k,n) = (2j(x + l) + 
i,j{x + l)+i), then Sit\x) = R^n\x) + 1. 

- 7/3 (i,j) e {l,...,a;} X N* such that {k,n) = {{2j + l){x + 1) + i,j{x + 1) + i) or {k,n) = 
((2j + 1)(.T + l)+i, (j + l)(a; + 1)), then si^\x) = R^\x) - 1. 

- Else, Sit\x) = R^n\x). 



Prom this result, one can deduce the exact law of the test-statistic under -ffi,c, and the power of the test 
follows. In the next Proposition, we show that, for n large enough, the distribution function of i„ is well 
approximated by the distribution function of L+ (or L~, depending on the value of p{c)): 
Proposition 4.2 7/V i = 1, . . . ,n, Pr{ei > 0) = p{c), p{c) > ^, (resp. p{c) < \), then 

V k, Pr{Ln <k) = Pr{L+ <k) + o(l) when n -)■ oo 

(resp. Pr{Ln < k) = Pr{L- < k) + o(l) 



5. Proofs. 



Proof of Proposition 4.1: 

The recursive formula to compute sit\x), the number of sequences of length n that contain k positive 
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residuals in which the length of the longest run of positive or negative residuals does not exceed x, is 
found through a direct combinatorial analysis. 
We distinguish the following cases : 

(i) For n — k < X and k < x, S^\x) is equal to the binomial coefficient 

(a) When n — k < x and k > x, all the not-favorable sequences (that is, the sequences of length n 
that contain k positive residuals in which the length of the longest run of residuals having the same signs 
exceeds x) will contain at least a run of consecutively positive residuals (and no run of consecutively 
negative residuals) of length larger than x. In this particular case, we want to study the length of the 
longest head run in n tosses of a biased coin including k heads, problem solved by [9] . 

(Hi) In a similar way, when n — k > x and k < x, the problem is the same, swapping heads and tails. 

(iv) For a fixed x and k, when n — k > x and k > x, The key is to partition the set of favorable 
sequences according to their beginning. Each sequence of length n that contains k positive residuals in 
which the length of the longest run of residuals having the same sign docs not exceed x can begin in at 
most 2x different ways and every beginning is followed by a sub-sequence with no more than x consecutive 
residuals having the same sign. In Table 1, we introduce the notation for the number of favorable sequences 
conditionally to the possible beginnings. 
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Table 1 



The possible beginnings for a favorable sequence and the associated number of favorable sequencees (and upper bounds) 
Clearly, sL^^ = 7Vi+_ + N2+- + ... + iV,+_ + iVi_+ + . . . + iV,_+. 

Let determine the number of "favorable" sequences beginning by a positive residual and then a negative 
one iVi+_. 
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Step 1: 



is at most equal to Sl^_2 (x) (i.e., the number of favorable ways to complete a sequence beginning 

by +-, see Table 1). 

Step 2: 

Among these (n — 2)— sequences (the first two signs of the residuals arc fixed), those beginning by x "-" 
must be taken away because, in this case, the obtained sequences admit x + 1 consecutive "-" (see Table 
2)- 



1 2 


3 ■•• ix + 2) {x + 3) 




+ - 


_ ... _ + 





Table 2 

Form of the sequences to "subtract" to the S'f^*L/'(x) previous. 

There are S^S2\x+i)i^') them. At this point, A^-| is at least equal to sl^S2 \x) — 'S'^*l^^(^_,_i)(a;). 

Further steps: 

Analogously, {n — 2 — {x + 1))— sequences beginning by x "+" must be subtracted from the S^^_^}^^j^^~^ (x) 
sequences taken away previously (see Table 3). 



1 2 


3 • 


■ (a: + 2) {x + 3) 


{x + 4) ■ 


• (2a; + 4) (2a; + 5) 




+ - 


_ ... _ + 


+ ••• + 





Table 3 

Form of the sequences to "add" to the S^_2\x) — S^^j^^^^j^^ (a;) previous. 
Recursively, 

N. - V qC^-I-JX^+I)) (r) - q(k-2-jix+l)) /A 
^^+- - \fn-2-{2j){x+l) '^n-2-(2j+l)(x+l) J 

j>0 

Note that for j large enough, indexes become negative. We use the following conventions for all x, 
S^"\x) = 1 and V n € R*, fc G N*, S^S,^^\x) = S^i'lix) = si~''\x) = 0. We use the same method to 
calculate N,, for every possible beginning, we conclude the proof of Formula (2) by summing them. 
There are some "special points" that need a correction when applying the Formula (2). These points are 
such that the quantity S'^{x) appears in the formula when k < ^ (or the quantity S^{x) when k > 
We underline that when = f , the quantities S^{x) and S^{x) do not appear in the Formula. 

For example, if 3 (i, j) £ {1, . . . ,x} x N* such that {k,n) = {2j{x + 1) + i,j{x + 1)), then the point 
Sl^^\x) appears in the term ~ X]i=i '5i-i-(2j+i)*(x+i)-i(^) '-'^ recursive formula of sli^\x). In this 

case, S^\x) represents the number of sequences of length x with x negative residuals and zero positive 
residuals that must be substracted when the last residual before the x last ones is negative. So, this 
sequence {S^\x) = 1) mustn't be substracted since it hadn't been counted before (because it would have 
make appear a sequence of {x+1) consecutive negative residuals). 
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Note that in other cases, Sx (x) has to be taken into account (for example is the (n — a;)-sequence pre- 
ceding the X last residuals ends with a positive residual). 

Similar considerations yield to the three other corrections. 

The recursive fornuila (2) becomes more clearful if we look at the Table (5) which illustrates it. 

In Table (5), for fixed n, k and x we represent the coefficients to assign to each S^\x) (where k < k 

— \ (k) 

and n < n) in order to compute Sn ■ The sign "+" means that such coefficient equals 1, "— " that such 
coefficient equals —1, and an empty cell means that the coefficient equals 0. 

Proof of Proposition 4.2: 

This proposition is a direct application of the fact that Pr{L~ < L+) tends to 1 when n tends to infinity 
as shown in Muselli (2000), since, for all a; > 1, 

Pr(L„ <x) = Pr(L„ < x\L- < L+)Pr{L- < L+) + Pr{L^ < x\L- > L+)Pr{L- > L+) 

and L„ = max(iy~,L+). 
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Table 4 

Illustration of Recursive Formula (2), for a fixed x. 
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